Hi! I’m Olivia :)

Location of Le Pont-de-Beauvoisin on a map of France

Location of Palmerston North on a map of New Zealand

Hi! I’m Olivia :)

Statistical scientist at Plant & Food Research – Omics analyses and multi-omics integration

Multivariate analyses

Visualisations

R packages

The issue with complex analyses in R

Schema of the analysis for one of my thesis chapters

The issue with complex analyses in R

Project folder:

thesis/chapter4_code 
├── genomics_data_analysis
│   ├── 00_genomics_wrangling.Rmd
│   ├── 01_genomics_filtering.Rmd
│   └── 02_genomics_eda.Rmd
└── transcriptomics_data_analysis
|   ├── 00_transcriptomics_wrangling.Rmd
|   ├── 01_transcriptomics_normalisation.Rmd
|   ├── 02_transcriptomics_diff_expr.Rmd
|   └── 03_transcriptomics_wgcna.Rmd
...
  • Input data has changed; what do I need to re-run?

  • In which order do I need to run these scripts?

  • I want to change one line of text in my report, how long will it take to re-generate the output?

The {targets} package

The project

Typical data science project would include:

  • cleaning the data
  • creating some visualisations for EDA
  • fitting a model
  • generating a report for the stakeholders  

We will analyse the palmerpenguins dataset: size measurements for three penguin species observed in the Palmer Archipelago (Antarctica)

Artwork by @allison_horst

Start coding!

I have written some code…

analysis/first_script.R
library(tidyverse)
library(janitor)
library(ggbeeswarm)
library(here)

# Reading and cleaning data
penguins_df <- read_csv(here("data/penguins_raw.csv"), show_col_types = FALSE) |> 
  clean_names() |> 
  mutate(
    species = word(species, 1),
    year = year(date_egg),
    sex = str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    across(where(is.character), as.factor)
  ) |> 
  select(
    species,
    island,
    year,
    sex,
    body_mass_g,
    bill_length_mm = culmen_length_mm,
    bill_depth_mm = culmen_depth_mm,
    flipper_length_mm
  ) |> 
  drop_na()

## Violin plot of body mass per species and sex
penguins_df |> 
  ggplot(aes(x = species, colour = sex, fill = sex, y = body_mass_g)) +
  geom_violin(alpha = 0.3, scale = "width") +
  geom_quasirandom(dodge.width = 0.9) +
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal()

## Violin plot of flipper length per species and sex
penguins_df |> 
  ggplot(aes(x = species, colour = sex, fill = sex, y = flipper_length_mm)) +
  geom_violin(alpha = 0.3, scale = "width") +
  geom_quasirandom(dodge.width = 0.9) +
  scale_colour_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal()

## Scatter plot of bill length vs depth, with species and sex
penguins_df |> 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm, colour = species, shape = sex)) +
  geom_point() +
  scale_colour_brewer(palette = "Set2") +
  theme_minimal()

I have written some code…

Good start, but often:

  • many more steps in the analysis \(\rightarrow\) script becomes long and convoluted

  • harder to get an overview of the analysis, and to find things

  • don’t want to re-run all the code everytime I make a change

Solution

Turn the code into a {targets} pipeline!

Step 1: Turn your code into functions

From:

# Reading and cleaning data
penguins_df <- read_csv(
  here("data/penguins_raw.csv"), 
  show_col_types = FALSE
) |> 
  clean_names() |> 
  mutate(
    species = word(species, 1),
    year = year(date_egg),
    sex = str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    across(where(is.character), as.factor)
  ) |> 
  select(
    ## all relevant columns
  ) |> 
  drop_na()

To:

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    species = stringr::word(species, 1),
    year = lubridate::year(date_egg),
    sex = stringr::str_to_lower(sex),
    year = as.integer(year),
    body_mass_g = as.integer(body_mass_g),
    dplyr::across(
      dplyr::where(is.character), 
      as.factor
    )
  ) |> 
  dplyr::select(
    ## all relevant columns
  ) |> 
  tidyr::drop_na()
}

Step 1: Turn your code into functions

Don’t forget to document your functions! ({roxygen}-style)

#' Read and clean data
#' 
#' Reads in the penguins data, renames and selects relevant columns. The
#' following transformations are applied to the data: 
#' * only keep species common name
#' * extract observation year
#' * remove rows with missing values
#' 
#' @param file Character, path to the penguins data .csv file.
#' @returns A tibble.
read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
  janitor::clean_names() |> 
  dplyr::mutate(
    ## modifying columns
  ) |> 
  dplyr::select(
    ## all relevant columns
  ) |> 
  tidyr::drop_na()
}

Step 1: Turn your code into functions

Improved script:

R/helper_functions.R
#' Read and clean data
#' 
#' ...
read_data <- function(file) { ... }

#' Violin plot of variable per species and sex
#' 
#' ...
violin_plot <- function(df, yvar) { ... }

#' Scatter plot of bill length vs depth
#' 
#' ...
plot_bill_length_depth <- function(df) { ... }
analysis/first_script.R
library(here)

source(here("R/helper_functions.R"))

penguins_df <- read_data(
  here("data/penguins_raw.csv")
)

body_mass_plot <- violin_plot(
  penguins_df, 
  body_mass_g
)

flipper_length_plot <- violin_plot(
  penguins_df, 
  flipper_length_mm
)

bill_scatterplot <- plot_bill_length_depth(
  penguins_df
)

Step 2: Turn your main script into a {targets} pipeline!

From:

library(here)

source(here("R/helper_functions.R"))

penguins_df <- read_data(here("data/penguins_raw.csv"))

body_mass_plot <- violin_plot(penguins_df, body_mass_g)

flipper_length_plot <- violin_plot(penguins_df, flipper_length_mm)

bill_scatterplot <- plot_bill_length_depth(penguins_df)

Step 2: Turn your main script into a {targets} pipeline!

To:

library(targets)
library(here)

source(here("R/helper_functions.R"))

list(
  tar_target(penguins_raw_file, here("data/penguins_raw.csv"), format = "file"),
  
  tar_target(penguins_df, read_data(penguins_raw_file)),

  tar_target(body_mass_plot, violin_plot(penguins_df, body_mass_g)),

  tar_target(flipper_length_plot, violin_plot(penguins_df, flipper_length_mm)),

  tar_target(bill_scatterplot, plot_bill_length_depth(penguins_df))
)

Step 2: Turn your main script into a {targets} pipeline!

Aside: where should my targets script live?

  • default would be in the _targets.R file in the main directory

  • to choose a custom folder and file name, need to specify targets configuration

 

In the console, run (from main directory):

targets::tar_config_set(script = "analysis/_targets.R", store = "analysis/_targets")

Will create a _targets.yaml file:

_target.yaml
main:
  script: analysis/_targets.R
  store: analysis/_targets

Visualise your pipeline

In the console, run:

targets::tar_visnetwork()

Execute your pipeline

In the console, run:

targets::tar_make()

here() starts at C:/Users/hrpoab/Desktop/GitHub/palmerpenguins_analysis
> dispatched target penguins_raw_file
o completed target penguins_raw_file [0 seconds]
> dispatched target penguins_df
o completed target penguins_df [0.85 seconds]
> dispatched target body_mass_plot
o completed target body_mass_plot [0.16 seconds]
> dispatched target bill_scatterplot
o completed target bill_scatterplot [0.02 seconds]
> dispatched target flipper_length_plot
o completed target flipper_length_plot [0.02 seconds]
> ended pipeline [1.31 seconds]

Get the pipeline results

targets::tar_read(bill_scatterplot)

Change in a step

Hi Olivia,

Great work! Just a minor comment, could you change the colours in the bill length/depth scatter-plot? It’s hard to see the difference between the species.

 

R/helper_functions.R
plot_bill_length_depth <- function(df) {
  df |> 
    ggplot2::ggplot(
      ggplot2::aes(
        x = bill_length_mm, 
        y = bill_depth_mm, 
        colour = species, 
        shape = sex
        )
    ) +
    ggplot2::geom_point() +
    ggplot2::scale_colour_brewer(palette = "Set2") +
    ggplot2::theme_minimal()
}

Change in a step

Hi Olivia,

Great work! Just a minor comment, could you change the colours in the bill length/depth scatter-plot? It’s hard to see the difference between the species.

 

R/helper_functions.R
plot_bill_length_depth <- function(df) {
  df |> 
    ggplot2::ggplot(
      ggplot2::aes(
        x = bill_length_mm, 
        y = bill_depth_mm, 
        colour = species, 
        shape = sex
        )
    ) +
    ggplot2::geom_point() +
    ggplot2::scale_colour_brewer(palette = "Set1") +
    ggplot2::theme_minimal()
}

Change in a step

targets::tar_visnetwork()

Change in a step

targets::tar_make()

here() starts at C:/Users/hrpoab/Desktop/GitHub/palmerpenguins_analysis
v skipped target penguins_raw_file
v skipped target penguins_df
v skipped target body_mass_plot
> dispatched target bill_scatterplot
o completed target bill_scatterplot [0.35 seconds]
v skipped target flipper_length_plot
> ended pipeline [0.53 seconds]

Change in a step

targets::tar_read(bill_scatterplot)

Change in the data

Hi Olivia,

Oopsie! We realised there was a mistake in the original data file. Here is the updated spreadsheet, could you re-run the analysis with this version?

targets::tar_visnetwork()

Change in the data – using {assertr} for data checking

Problem: a change in the data format (column names or format, range of values, etc) might cause the pipeline to fail

Solution

Use {assertr} to check that assumptions about the data are correct!

Using {assertr} for data checking

 

Adding the following checks:

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?
read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?

  • Is the body mass already an integer?

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    assertr::assert(rlang::is_integerish, body_mass_g) |>
    ## data cleaning code
}

Using {assertr} for data checking

 

Adding the following checks:

  • Are all necessary columns present in the dataset?

  • Is the body mass already an integer?

  • Are all flipper length values positive (after removing NAs)?

read_data <- function(file) {
  readr::read_csv(file, show_col_types = FALSE) |> 
    janitor::clean_names() |> 
    assertr::verify(
      assertr::has_all_names(
        "species", "island", "date_egg", "sex",
        "body_mass_g", "culmen_length_mm",
        "culmen_depth_mm", "flipper_length_mm"
      )
    ) |>
    assertr::assert(rlang::is_integerish, body_mass_g) |>
    ## data cleaning code
    assertr::verify(flipper_length_mm > 0)
}

Note

More on data checking in Danielle Navarro’s blog post on Four ways to write assertion checks in R.

Writing a report with Quarto

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format:
  docx:
    number-sections: true
---

```{r setup}
#| include: false

library(knitr)

opts_chunk$set(echo = FALSE)
```

This project aims at understanding the differences between the size of three species of penguins (Adelie, Chinstrap and Gentoo) observed in the Palmer Archipelago, Antarctica, using data collected by Dr Kristen Gorman between 2007 and 2009.

## Distribution of body mass and flipper length

@fig-body-mass shows the distribution of body mass (in grams) across the three penguins species. We can see that on average, the Gentoo penguins are the heaviest, with Adelie and Chinstrap penguins more similar in terms of body mass. Within a species, the females are on average lighter than the males.

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Similarly, Gentoo penguins have the longest flippers on average (@fig-flipper-length), and Adelie penguins the shortest. Again, females from a species have shorter flippers on average than the males.

```{r fig-flipper-length}
#| fig-cap: "Distribution of penguin flipper length (mm) across species and sex."

# code for plot
```


## Association between bill length and depth

In this dataset, bill measurements refer to measurements of the culmen, which is the upper ridge of the bill. There is a clear relationship between bill length and depth, but it is masked in the dataset by differences between species (@fig-bill-scatterplot), with Gentoo penguins exhibiting longer but shallower bills, and Adelie penguins shorter and deeper bills.

```{r fig-bill-scatterplot}
#| fig-cap: "Scatterplot of penguin bill length and depth."

# code for plot
```

 

Writing a report – Quarto + {targets}

Two advantages of using a Quarto document alongside {targets}:

  • can read in results from targets pipeline inside the report: no computation done during report generation

  • can add the rendering of the report as a step in the pipeline: ensures that the report is always up-to-date

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)

opts_chunk$set(echo = FALSE)
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Two steps to use the {targets} pipeline results in a Quarto document:

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)
library(here)

opts_chunk$set(echo = FALSE)
opts_knit$set(root.dir = here())
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

# code for plot
```

Two steps to use the {targets} pipeline results in a Quarto document:

  • Make sure the report ‘sees’ the project root directory

Writing a report – Quarto + {targets}

reports/palmerpenguins_report.qmd
---
title: "Analysis of penguins measurements from the palmerpenguins dataset"
author: "Olivia Angelin-Bonnet"
date: today
format: docx
---

```{r setup}
#| include: false

library(knitr)
library(here)
library(targets)

opts_chunk$set(echo = FALSE)
opts_knit$set(root.dir = here())
```

This project aims at understanding the differences...

## Distribution of body mass and flipper length

@fig-body-mass shows...

```{r fig-body-mass}
#| fig-cap: "Distribution of penguin body mass (g) across species and sex."

tar_read(body_mass_plot)
```

Two steps to use the {targets} pipeline results in a Quarto document:

  • Make sure the report ‘sees’ the project root directory

  • Read targets objects with targets::tar_read()

Writing a report – Quarto + {targets}

Adding the Quarto report as a step in the pipeline (need {tarchetypes} and {quarto}):

analysis/_targets.R
library(targets)
library(tarchetypes)
library(here)

source(here("R/helper_functions.R"))

list(
  tar_target(
    penguins_raw_file, 
    here("data/penguins_raw.csv"), 
    format = "file"
  ),
  tar_target(
    penguins_df, 
    read_data(penguins_raw_file)
  ),
  # etc
  tar_quarto(
    report, 
    here("reports/palmerpenguins_report.qmd")
  )
)

     

{targets} – how to reproduce the analysis?

README.md
## How to reproduce the analysis

```{r}
# Install the necessary packages
renv::restore()

# Run the analysis pipeline
targets::tar_make()
```

Thank you for your attention!

olivia.angelin-bonnet@plantandfood.co.nz

Presentation disclaimer

Presentation for
Event, location, Country, Event date

Publication data:
Authors month year. Presentation title. SPTS No. XXXXX.

Presentation prepared by:
Author name
Scientist, Group
Month Year

Presentation approved by:
SGL
Science Group Leader, Group
Month Year

For more information contact:
Author name
DDI: +64 X XXX XXXX
Email: firstname.lastname@plantandfood.co.nz

 

This report has been prepared by The New Zealand Institute for Plant and Food Research Limited (Plant & Food Research).
Head Office: 120 Mt Albert Road, Sandringham, Auckland 1025, New Zealand, Tel: +64 9 925 7000, Fax: +64 9 925 7001.
www.plantandfood.co.nz

 

DISCLAIMER

The New Zealand Institute for Plant and Food Research Limited does not give any prediction, warranty or assurance in relation to the accuracy of or fitness for any particular use or application of, any information or scientific or other result contained in this presentation. Neither The New Zealand Institute for Plant and Food Research Limited nor any of its employees, students, contractors, subcontractors or agents shall be liable for any cost (including legal costs), claim, liability, loss, damage, injury or the like, which may be suffered or incurred as a direct or indirect result of the reliance by any person on any information contained in this presentation.

© COPYRIGHT (202X) The New Zealand Institute for Plant and Food Research Limited. All Rights Reserved. No part of this report may be reproduced, stored in a retrieval system, transmitted, reported, or copied in any form or by any means electronic, mechanical or otherwise, without the prior written permission of The New Zealand Institute for Plant and Food Research Limited. Information contained in this report is confidential and is not to be disclosed in any form to any party without the prior approval in writing of The New Zealand Institute for Plant and Food Research Limited. To request permission, write to: The Science Publication Office, The New Zealand Institute for Plant and Food Research Limited – Postal Address: Private Bag 92169, Victoria Street West, Auckland 1142, New Zealand; Email: SPO-Team@plantandfood.co.nz.